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A METHOD FOR UTILIZING TEMPORAL MASKING IN DIGITAL AUDIO CODING 

Inventor: Wan-Chieh Pal 

BACKGROUND OF THE INVENTION 

FIELD OF THE INVENTION 

The present invention relates generally to the field of digital audio and more 
specifically, to the field of perceptual coding of digital audio. 

BACKGROUND 

Perceptual coders analyze the frequency and amplitude content of an input 
signal and compare it to a model of human auditory perception. Using the model, the 
encoder removes the irrelevancy of the audio signal. In theory, although the method is 
lossy, the human perceiver will not hear degradation in the decoded signal. 
Considerable data reduction is possible. A well-designed perceptually coded recording, 
with a conservative level of reduction, can rival the sound quality of a conventional 
recording because the data is coded in a much more intelligent fashion, and because 
the listener doesn't hear all of what is recorded to begin with. In other words, 
perceptual coders require only a fraction of the data needed by a conventional system. 

Data reduction coders attempt to represent the audio signal at a reduced bit rate 
while minimizing quantization error. Time-domain coding methods such as delta 
modulation can be considered to be data-reduction coders. They use prediction 
methods on samples representing the full bandwidth of the audio signal and yield a 
quantization error spectrum that spans the audio band. Frequency-domain encoders 
take a different approach. The signal is analyzed in the frequency domain and coded 
so that quantization error can be assigned and masked based on psychoacoustic 
characteristics of the ear. However, coder complexity is greatly increased. 
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Most low-bit-rate codecs use psychoacoustic models to adaptively quantize only 
the perceptually significant parts of the signal. Parts of the signal that are below the 
minimum threhold, or masked by more significant signals, are judged to be inaudible 
and are not coded. 

Amplitude masking occurs when a tone shifts the threshold curve upward in a 
frequency region surrounding the tone. The masking threshold describes the level 
where a tone is barely audible. When tones are sounded simultaneously, masking 
occurs in which louder tones can completely obscure softer tones. For example, a tone 
of 500 Hz can mask a concurrent softer tone of 600 Hz. The strong sound is called the 
masker and the softer sound is called the maskee. Masking theory argues that the 
softer tone is just detectable when its energy equals the energy of the part of the louder 
masking signal in the critical band; this is a linear relationship with respect to amplitude. 
Generally, depending on relative amplitude, soft (but otherwise audible) audio tones 
are masked by louder tones at a similar frequency (within 100 Hz at low frequencies). 

Temporal masking occurs when tones are sounded close in time, but not 
simultaneously. A signal can be masked by a noise or another signal that occurs later. 
This premasking is sometimes called backward masking. In addition, a signal can be 
masked by a noise or another signal that ends before the signal begins. This is post 
masking, sometimes called forward masking. In other words, a louder tone appearing 
just before (pre-masking), or after (post masking) a softer tone overcomes the softer 
tone. Just as simultaneous masking increases as frequency differences are reduced, 
temporal masking increases as time differences are reduced. 

Temporal masking decreases as the duration of the masker decreases. In 
addition, a tone is post masked by an earlier tone when they are close in frequency or 
when the earlier tone is lower in frequency. Post masking is slight when the masker 
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has a higher frequency. Logically, simultaneous masking is stronger than either pre- or 
post masking because the sounds occur at the same time. 

Temporal masking is important in frequency domain coding. These coders have 
limited time resolution because they operate on blocks of samples, thus spreading error 
over time. Temporal masking can overcome audibility of artifacts caused by transient 
signals. Ideally, filter banks should provide a time resolution of 2 to 4 ms. Acting 
together, amplitude and temporal masking form a contour that can be mapped in the 
time-frequency domain. 

In subband coding, blocks of consecutive time-domain samples representing the 
broadband signal are collected over a short period and applied to a digital filter bank. 
The filter bank divides the signal into multiple bandlimited channels to approximate the 
critical band response of the human ear. 

Each subband is coded independently with greater or fewer bits allocated to the 
samples in the subband. In any case, quantization noise is increased in each subband. 
However, when the signal is reconstructed, the quantization noise in a subband will be 
limited to that subband, where it is masked by the audio signal in each subband. Bit 
allocation is determined by a psychoacoustic model and analysis of the signal itself. 
These operations are recalculated for every subband in every new block of data. 
Samples are dynamically quantized according to audibility of signals, and noise. There 
is great flexibility in the psychoacoustic models and bit allocation algorithms used in 
coders that are othen/vise compatible. The decoder uses the quantized data to re-form 
the samples in each block. An inverse synthesis filter bank sums the subband signals 
to reconstruct the output broadband signal. 
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A subband perceptual coder uses a digital filter bank to split a short duration of 
the audio signal into multiple bands. In some designs, a side-chain processor applies 
the signal to a transform such as an FFT to analyze the energy in each subband. 
These values are applied to a psychoacoustic model to determine the combined 
masking curve that applies to the signals in that block. This permits more optimal 
coding of the time-domain samples. Specifically, the encoder analyzes the energy in 
each subband to determine which subbands contain audible information. A calculation 
is made to determine the average power level of each subband over the block. This 
average level is used to calculate the masking level due to masking of signals in each 
subband, as well as masking from signals in adjacent subbands. Finally, minimum 
hearing threshold values are applied to each subband to derive its final masking level. 
Peak power levels present in each subband are calculated and compared to the 
masking level. Subbands that do not contain audible information are not coded and in 
some cases entire subbands can mask nearby subbands which thus need not be 
coded. 



SUMMARY OF THE INVENTION 



The present invention comprises a method incorporating the use of a filter which 
accepts simultaneous masking signals and generates a close replica of temporal 
masking signals derived from the input simultaneous masking signals. The filter output 
is then added to the filter input to provide a composite masking signal. This composite 
masking signal may then be used to establish overall masking threshold levels which 
can be mapped in the appropriate subband to significantly reduce the amount of coding 
quantization required without significantly affecting the perceived sound of the 
reconstructed broadband signal. 
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In a preferred embodiment of the present invention, storage and computation 
usage are reduced by: (1) Employing such filtering for only about the lower two-thirds of 
the subbands; (2) using a second order auto-regressive and a second order moving 
average filter characteristic. The transfer function of the resulting filter may then be 
represented as: 

0.256Z"'' + 0.059z"^ 

H(z) = 

1 -0.39Z"'' -0.295z"^ 

And its impulse response as: 

H(z)= 0.2224 (0.7721)" n (n) + 0.0336 (-.3821)" (n) 

The filter's transfer function and impulse response define a filter the output of 
which exhibits two principal characteristics of temporal masking. One such 
characteristic is decay with the logarithm of time. The other is a rate of decay that is 
inversely proportional to the duration of the corresponding simultaneous masking. 

BRIEF DESCRIPTION OF THE DRAWINGS 

The aforementioned objects and advantages of the present invention, as well as 
additional objects and advantages thereof, will be more fully understood hereinafter as 
a result of a detailed description of a preferred embodiment when taken in conjunction 
with the following drawings in which: 

FIG. 1 is a graphical illustration of simultaneous and temporal masking; 

FIG. 2 is a graphical illustration of temporal masking decay showing its linearity 
with time in log; 
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FIG. 3 is a graphical comparison of decay in an ideal filter and in a regular MR 

filter; 

FIG. 4 is a graphical comparison of performance between an ideal filter and 3-2 
ordered ARMA MR filter; 

FIG. 5 is a graphical comparison of performance between an ideal filter and a 2- 
2 ordered ARMA MR filter; 

FIGs. 6A, 6B and 6C illustrate in flowchart form the method of the present 
invention. 

While the invention is susceptible to various modifications and alternative forms, 
specific embodiments thereof have been shown by way of example in the drawings and 
will herein be described in detail. It should be understood, however, that the detailed 
description is not intended to limit the invention to the particular forms disclosed. On 
the contrary, the intention is to coverall modifications, equivalents, and alternatives 
falling within the spirit and scope of the invention as defined by the appended claims. 
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DETAILED DESCRIPTION OF A PREFERRED EMBODIMENT 

FIG. 1 shows the basic principles indicating how masking thresholds are formed 
where simultaneous and temporal masking are caused by two different maskers. As 
shown in FIG. 1 , forward masking thresholds decay with time from the simultaneous 
masking threshold caused by the same masker. In addition, the longer the masker 
lasts, the slower its fonA/ard masking threshold decays. As soon as the masker signal 
ends, the temporal masking thresholds starts out with the same magnitude of 
simultaneous masking threshold, and decays with time. Temporal masking effect not 
only exists in the frequency bands with the same frequency components, but it also 
affects all of the bands affected by simultaneous masking. 

Several factors affect the amount of forward masking: (1) Time difference from 
the ending edge of masker; masking decays exponentially in log time; (2) duration of 
masker; the longer the masker is, the slower the masking decays; (3) frequency relative 
to the masker; the way that masking decays is different for on-frequency, higher 
frequency and lower frequency bands; (4) absolute frequency of the masker* masking is 
more effective in medium frequency bands (around 1000 Hz) than in high and low 
frequency bands; (5) power of masker; masking caused by a stronger masker decays 
faster; and (6) structure of the spectrum; decay of masking is faster if the masker is 
accompanied by other flanking signals in its neighboring bands. FIG. 2 illustrates the 
first two principles. In order to reduce computation for temporal masking, only these 
two factors are utilized. 

The temporal masking mechanism of the present invention is embodied on a 
MPEG layer-2 encoding software which adopts psychoacoustical model one to 
determine simultaneous masking. This model breaks the whole spectrum into 127 
bark-scaled subbands and computes a masking threshold for each subband. In the 
computation of the thresholds, the spectrum is simplified, thus no detail information can 
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be derived directly from the spectrum. As a result, the calculated simultaneous 
masking threshold is the only thing that can be used as input information into the filter 
to compute forward masking. 

There are several issues to consider in designing this filter. First, the temporal 
masking can last for more than 180 msec. That is longer than 7 frames when a 48k 
sampling frequency is used. In order to account for the influence for such a long 
duration, a finite impulse response (FIR) filter needs to have the simultaneous masking 
thresholds for at least 7 previous frames. That is, 

7[audio frames] x 127[sub-bands] x 2[channels] = 1778 extra double variables needed 

To reduce the storage need, an infinite impulse response (MR) filter is used. 
Second, the ordinary MR filters (if they are stable) have the following form of outputs 

M 

y{n) = 2^,(2,)'' 

/=1 

where m is the order of the MR filter, and zi, 1=1, ...,m, are poles of the MR filter, and Zi 
have absolute values smaller than 1. 

According to the above equation, the output, y(n), decays exponentially with 
linear time, not with the logarithm of time as temporal masking thresholds act. To 
correct this discrepancy, the decay is pushed closer to decaying with the logarithm of 
time. FIG. 3 illustrates this problem: The three solid lines, from top to bottom, are the 
output signals from a regular MR filter when the inputs are three, two, and one 
consecutive pulses. The three dashed lines are the corresponding desired outputs 
from an ideal filter. There are two major differences between the two sets of curves: 
One is that a linear decay rate is desirable with the logarithm of time, not with time itself; 
the other is that the decay rate for the output with shorter input is faster than the output 
with a longer input. 
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This problem is solved by the invention by making the output behave 
approximately ideally for at least the first several time frames after the temporal masker. 
After the first several frames, the temporal masking thresholds become less significant 
and are usually exceeded by simultaneous masking. Without any limitation on memory 
usage, the higher the filter order, the closer the realized decay curve can come to the 
ideal one. In terms of storage space, if the MR filter equation is: 

M L 

yin) = X a,x{n - /) + ^ ^/3^(^ " 

/=1 7=1 

and filtering is done for the lower 80 subbands (instead of 127), then the extra storage 
space needed is: 

(M + L - 1) X 80 X 2 = 160(M + L - 1) 

If a third order AR (auto-regressive) is attempted with a second order MA 
(moving average) filter, then 640 extra variables are needed, and after careful selection 
of filter coefficients, the following equation and the decay behavior in FIG. 4 are 
obtained: 

0.2504Z"'' + 0.0736z'^ 

H(z)= 

1 -0.39Z"'' -0.295z"^ 

According to FIG. 4, within 5 time frames from the masker, the temporal masking 
behavior is approximated by the 3-2 ordered ARMA filter. After 5 time frames, the 3-2 
ARMA filter usually under-estimates the temporal masking effect, although these are 
tolerable. If one wants to further reduce storage and computation usage in the process, 
one can simplify the above 3-2 ordered filter to a 2-2 ordered ARMA filter which uses 
480 extra double variables. The transfer function with optimal parameters is: 
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And its impulse response is: 

h(n) = 0.2224(0.7721 )"u(n) + 0.0336(-0.3821)"u(n) 

FIG. 5 compares the filter responses of this 2-2 filter and an ideal filter. 
Compared to the 3-2 ordered filter, it can be seen from this figure that there is more 
deviation in the 2-2 ordered filter response at the first several frames. The test result 
shows that there is no major degradation in performance from the 3-2 filter to the 2-2 
filter. 

There is one more issue in designing this temporal masking mechanism. After 
computing the temporal masking thresholds for different frequency bands, those results 
must be incorporated with the simultaneous masking thresholds. Some existing 
systems compare the two and pick up the maximum, while some add the two 
thresholds together. The preferred embodiment of the present invention shows that the 
encoding quality is better when the two thresholds are added to form the composite 
masking thresholds. 

FIGs. 6A, 6B and 6C illustrate in flowchart form the above-described method of 
the invention. At step 610, a filter is provided, the filter having an identified transfer 
function. At step 620, simultaneous masking filters are input into the provided filter. At 
step 630, an approximate replica of appropriate temporal masking filters is generated at 
the filter output. A composite masking signal is then formed, at step 640, by adding 
simultaneous masking signals and replica temporal masking signals. At step 650, a 
masking threshold level is established using the generated composite masking signal. 
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Next, the series of iterative steps illustrated as either step 655 or step 665 is executed 
At step 655, as illustrated in FIG. 6A, the code is quantized in a plurality of frequency 
domain subbands, and each of steps 610-650 is performed for each subband. In the 
alternative, at step 665 as illustrated in FIG. 6B, the code is quantized in a plurality of 
sequential time frames and each of steps 610-650 is performed for each time frame. 

Having thus described a preferred embodiment of the method of the present 
invention, it being understood that other embodiments are contemplated, what is 
claimed is: 
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